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ABSTRACT 


Two processes, i.e., relevance feedback and retrieval on 
clustered files, are modelled and analysed. Experiments are 
also conducted to verify part of the theoretical results. 
For each individual process, the behavior of the system 
performance is studied under the variation of the key 
parameters of the process. Together, the processes lend 
themselves as examples for studying modelling and analytic 


techniques for evaluating information retrieval processes. 
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CHAPTER 1 


INTRODUCTION 


i~t” Problem Area 


This thesis addresses itself to the problem of the 
analysis of processes in information retrieval. Two impor- 
tant processes, namely, relevance feedback (RF) and retrieval 
in clustered files (RCF) are selected as candidates for 
detailed investigation. The purpose is actually two-fold. 
Astaypractical, short-range goal, the analysis will reveal 
the intrinsic relationships among various key parameters of 
the processes, indicate regions in the parameter space which 
guarantee good results, and, in some cases, derive optimal 
values for the parameters. These analytical results should 
prove useful to those designers of information systems who 
wish to adopt these processes. On the other hand, the 
modelling, as well as the analytic techniques will hopefully 
serve as valuable examples to others with similar research 
interest. Two different models are constructed for the 
processes. The model used for RF is developed from Swets' 
continuous model, in which the items and attributes are 
anvisibles, bt is by and large a “macroscopic model “For 
RCF, however, a discrete and "microscopic" model is used 
which heavily depends on the occurrences of each attribute. 


Chapter 4 will be devoted to further explaining details of the 


two models and contrasting one to another. 


1-2 intormation Systems 


As a branch subject of computer science, information 
retrieval is not very well aeeiieds in fact, a farge parc 
OF non mMumMeric computing activities can be classified ag 
information storing and retrieving. It is therefore appro- 
piate to first define, before proceeding with the main body 
of this thesis, the kind of information systems on which the 
subsequent discussions are based. 

The major components of an information retrieval 
system are depicted in Fig. 1. Contained in the data base 
are a set of records, each of which represents an item in 
the "information base", where the ultimate information needed 
by the users is stored. The basic unit of an information 
base is an item, which can be a document, a personnel record, 2 
description of an auto part OF an antique in the museum etc. 
Corresponding to each item there is a record in the data base 
whicheconsists Of a set Of attributes, chosen to represent 
the item. Through the process of indexing, the information 
base is converted into a data base which is structured for 
computer searching. 

To describe the functioning of the system, we start 
with the user who requests MnO Mld tons ee lihto eCegiWes Calis 
often expressed as a query which, like a record in the data 


base, also consists of a number of attributes.  Thesguery 1s 
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then submitted to the system which matcl® s it with each 
record in the whole data base (or its subset). In the type 
of system discussed here, best-match method is used and only 
those which are considered as "close" to the request will be 
retraeved (the concept of closeness will he clearly defined 
later). The number of items retrieved is usually controlled 
by an input parameter, called the threshold,which is deter- 
mined either by the user or the aeean manager. The system 
is an on-line system. The user can interact with the system 
through some communication channel, like a terminal. 

To help analyse the whole retrieval process, the 
terms introduced above have to be more rigorously defined. 
Let n be the total number of attributes in the system. Each 
record in the data base is then generalized as a n-tuple 
where the ith component represents the values for the ith 
attribute. A query is similarly defined. A value of 0 in 
the ith component indicates that the attribute is not related 
to the item represented by this record. The higher the value 
assigned to the ith component, the more important the ith 
@actribuce, 2s considered toybe to the Item. ~~ In some cases 
however, a record (and query) might be simplified as a 
Dinary vector. The Similarity between an item and a request 
is quantified by the similarity between the record and the 
query representing them. Here we express the Similarity 
between two n-vectors A=(a,1-+-+1/4,) and B=(bj,--+rb_) by 


means of the simple matching function 
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ifye end B represent a quéry and a record respectively, 4 
larger function value means that the record is closer to a 
query and hence that the record has a better chance of being 
Petr leven ulna petticilatgue.t al ethe  pecordssanilcuenuies 

in thessystem are binaryevectors, £ wit) take as tte valine 
the number of attributes in common between the query and 

the record.) The record will Jbe retrsevyed ifvend, only iit 
£(A,B) > t, where t is a pre-assigned threshold value. 
Alternatively, the user can restrict the-number Of retrieved 
Ltems;.say.10,. so thatvonly the 10 items owiehe the highest 
function values are retrieved. 

Geometrically, the items and queries can be regarded 
as points (or vectors) in the n-dimensional space, the 
distance between the vectors being measured in some norm. 
With the threshold t as the radius and the query as the 
centre, all the items that fall within this sphere are 
retrieved. Alternatively, if only 10 retrieved items are 


required, they will be the 10 closest neighbors of the 


query. 


1.3 System Effectiveness and Efficiency 


The best-match type of retrieval is not necessarily 
applicable to all information systems. Opposed to the idea 
of best-match is the exact-match which retrieves all those, 


and only those records matching exactly with the query, Lacey 
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containing all the attributes of the query. - So, 2.7 those 
Systems using the exact-match method,all the items retrieved 
will be pertinent to the user's needs. Apparently, this 
matching method does not always satisfy all users. For 
instance,in a library environment, the user of the system 
usually has a very vague idea of what he actually needs. 

He might want to find references on some subject, but does 
HOE CHOWECIC auknors ior tatles..| He ean only roughly 
describe the contents of the documents or books he needs 

by a few keywords, the choice of which is obviously a sub- 
jective one. It is therefore desirable to let the system 
determine which documents are most likely to be useful to 
him. Some advanced information systems like MEDLAR (by 
American Medical Library Association) and SMART (an experi- 
mental system for the researchers at Cornell University) 
have adopted such a method. 

Together with retrieval by content comes the pro- 
blem of relevance. For a system to be 100% effective, 
Bliethe tems that are «considered, as relévantwto the re- 
quest must be retrieved and every item retrieved must be 
relevant. This ideal situation is rarely achieved simply 
because the ultimate judgment on whether an item retrieved 
is useful or not, is made by the user who initiates the 
request. This problem persists to some extent even ina 


completely manual information system. 


On the other hand, computerized information systems, 
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thanks to their earlier successes, are gaining in popu- 
veraty. ‘Coupled with: the fact that the cost of buaiding 
and uSing one is decreasing, the users are demanding ever 
larger systems and their application is expanding into 

new areas. To meet these challenges, new processes have 


been devised, aiming at improving the system effectiveness 


and/or efficiency. Here are a few examples of such pro- 
cesses. 

(1) Manual indexing can no longer cope with the explosive 
information growth. Researchers are now looking into 


automatic text-processing methods, which will undoubtedly 
improve the efficiency. 
(2) If the data base is organized into different classes 
according to their contents, it becomes possible to search 
selectively some parts of the data base, thus saving a lot 
of time. 
(3) It can be beneficial to the user to communicate with 
the system his relevance decision on the items retrieved 
so that the system can utilize this feedback of information 
to retrieve more items that may be useful to hime) im tiie 
case, more computing time as well as the user's time will be 
consumed, but the user will probably be more Satisfied with 
the retrieval result. 

However, implementation of each of these processes 
requires a tremendous amount of both human effort and fina- 


ncial resources. There has to be some means of evaluating 
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these processes to see whether they are justified. For 
example, can automatic indexing compete with . Manual 
methods anvterm of producing metrievals of equally nigh 
Guaiity 2) Whateas: tthe risk of deteriorating! setrreval 
PertGremance by ignoring other parts of thei database? - Can 
the feedback method really improve the system effectiveness? 
Procedures should be established to provide answers to ques- 


tvons slike ‘these: 


1.4 System Evaluation 


The first systematic approach to system evaluation 
Was adopted by the famous Aslib project in Cranfield, Eng- 
land, in early 1960's. There, experiments were conducted 
to examine, among other things, different indexing strategies. 
The sample collections employed were documents on aero- 
dynamics and aircraft structure. The size of the collections 
ranged from 82 to 1400 documents per collection. The sample 
queries were submitted by aeronauticists and each document in 
the collection was also manually examined to determine its 
relevance to the query. The effectiveness of an indexing 
Strategy was measured by recall R and precision P, defined 
as: 


number of items retrieved and relevant 


R= —_,, 
total relevant in collection 


number of items retrieved and relevant 
ea eae rede tee ea Leal) werner ce Bin 


and P 


it 


total retrieved in collection ; 
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The SMART system [Salton 1968] greatly enhanced this method 
and automated it. More sample data bases on different 
subjects were added to the Cranfield collections and the 
system was capable of testing many more processes. The 
system is now available as a software package as testing 
ground for system designers and researchers to evaluate 
their newly devised methods as well as various input para- 
meters to the already known processes. The work carried 
Out by the intOrmation specialists 2b Arthur Dw Gittle. inc 
fGaulvanoe, 1966)" is similar in nature. This approach is 
still being used extensively and is generally considered 
acceptable by the industry. Nevertheless, with so many 
input parameters usually associated with each process, 
there are no assurances that the values chosen will in fact 
be optimal or near optimal (in some sense), or indeed will 
work at all. Recently, research articles that are rather 
theoretical in nature have emerged in this area [Brookes 
1968, Swets 1969, Bookstein 1974, Yu 1975 etc.]. However, 
most of them are mainly concerned with building the models 
rather than making use of them for some specific processes. 
Others emphasize indexing strategies. This thesis is the 
first real attempt to bring these well-known processes 
(weer reand RCH) and the models together sein doing, sa, 
not only are the models put to use for more constructive 
purpose, but also the models are tested for their short- 


comings and adequacy for mathematical manipulation. 
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CHAPTER 2 


RELEVANCE FEEDBACK 


Zell” Motavation 


It is generally conceded that there is plenty of 
room for new techniques that aim at improving the effective- 
ness of a computerized retrieval system. The computer is not 
an intelligent machine (no Dbueakthrougn wis yet ines git an 
the efforts of making it one) and man-machine communication 
is far from being perfect. Two possible remedying strategies 
to increase the interaction between the user and the system 
can be adopted. The data base can be "tuned" regularly 
based on the users' response on the previous retrieval per- 
formance. This involves changing the index representation 
of the data base. Quite a number of methods have been 
proposed and analyses of selective methods are provided 
{Yu 1976]. Alternatively, the user's query can be altered 
by the system in an interactive environment. The user 
evaluates each of the retrieved items as either relevant or 
irrelevant and then sends the information back to the system. 
The system then formulates a new query by making use of such 
information. Hopefully, this new query will retrieve more 
relevant items and fewer irrelevant items. This process is 
called relevance feedback. This process has been designed 


mainly for an on-line document retrieval system, where the 
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items are actually documents and the users 
are searching for quick references on some specific topics. 
Here, we are concerned more with the relevance of the 
retrieved documents than the representation of the documents 
Hm the systems Therefore, "in the rest of this chapter, we 
shall indiscriminately refer to an item or the record 
representing it as a document. 

A practical method for updating queries has been 


suggested by Rocchio [1965]. The new query is given by 
Of ==).0 + 0) D =) 8) p* (2535) 


where a, B20 are parameters, and D and D" sum over 
respectively the sets of relevant (R) and irrelevant (I) items 
retrieved by Q. The formula has the effect of increasing 
the influence of relevant documents (and hence, the attri- 
butes contained in them) and decreasing the effect of the 
irrelevant ones. There have been a lot of experiments 
conducted which show that this particular algorithm performs 
tPeasconavly weld [Rocchio 1965, 1966). “fhe behavior ar some 
variants of the method, such as deleting one of three terms 
in the equation has also been observed [Ide 1968, Crawford 
1968]. However, very little theoretical justification has 
been given. 

Intuitively,the relevance feedback method should 
improve the retrieval performance since the system has 


obtained more information from the user about his require- 
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ments. But it might fail if the relevant items are too 
dispersed or the user's query is too ill-formulated (this 
will happen if the user is too vague about what he actually 
wants). 

To explain this phenomenon in greater detail, let 
us consider a hypothetical system with only two attributes. 
In this way, each document can be adequately represented by 
a point in a plane (see Fig.2). Suppose that the system 
retrieves > documents ian, the first try, 2 relevant ana J 
LEreléevant. . All the 2 relevant documents retrieved are 
HOcCatece in the top-left of the retrieval Girclewwhrie the 
2 irrelevant ones are in the opposite position. Under the 
Sttect or (2.1), the query will be shifted in the direction 
of the relevant ones and away from the irrelevant ones. If 
the relevant documents (and to a lesser extent, the irre- 
levant ones) are "flocked" together (as in Fig.2(a))the new 
query thus generated will produce better eSenca.! How- 


ever, if the relevant documents are’ dispersed (see Fig.2(b)) 


"Te Ver interesting to note that when there are two or more 


"Flocks"of relevant documents retrieved, the query could as 
well be split up into a number of new queries, maybe one for 
each such "flock". There are usually relatively few relevant 
documents retrieved each time (that is why the feedback 
method is needed!), so it would be difficult to detect multi- 
ple flocks. Neverthesless experiments [Borodin 1968] have 
been conducted to test the split query method, although this 
method will not be dealt with here. 
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@® Relevant document [|] Original query Q 
© Irrelevant document Reformulated query ae 


Picst try second Try 
(Improved) 
Fig. 2(a) “ftlocking”™ of relevant documents 


Poeste try ~ Second Try 
(deteriorated) . 


Fig. 2(b) dispersion of relevant documents 


the retrieval performance can deteriorat. 
The other possible cause of failure of the method is 


that the system cannot distinguish the relevant documents 
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from the irrelevant ones. It occurs when the query is not 
adequately prepared in the first place, such as containing 
attributes with opposing meanings etc., or the data base is 
not organized properly with respect to this particutar 
query in question. As a result, the set of relevant docu- 
ments as a whole is no longer "closer" to the query than 
the set of. irrelevant documents. 

It is these arguments that motivated the analysis 
of relevance feedback. The arguments clearly indicate that 
some concepts have to be quantified in order that rigorous 
analysis can be carried out. Among them are the idea of 
documents "flocking together" and the distinction between 
relevant and irrelevant documents. In the next section, a 
probabilistic model will be constructed which will enable 
these concepts to be precisely defined. The arguments also 
pave the way along which the analysis can develop. [In fact, 
the analysis has succeeded in verifying these intuitive 


arguments. 


222) PA probabilistic Moder 


With respect to a query Q submitted by the user, 
the document space is divided (at least IMD LPC TOLy eile cae 
user's mind) into two disjoint subsets, namely the set R of 
relevant documents and the set I of irrelevant documents. 
Obviously, RcR and IcI, where R and I are defined in Section 


2.1. In the following, for each given query 9, we shall 
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define six classes of normal distributions over RB -.and. Tf. 

The first normal distribution is for the random 
variable which is the inner product £(0,D) between Q and a 
document D in the population set R. The normal distribution 
is the relative frequency of documents D which assume the 
value £(Q,D). The expected value and standard deviation of 


Chisedistr bution are assumed to be Uy and 0, respectively. 


1 
Similarly, we define the normal distribution for the vari- 
able £(Q,D') over the population set I with expected value 

Uy and standard deviation do. (The distributions presented 
so far are those of Swets [Swets 1963, Brookes 1968]). Next, 
for each retrieved relevant document D ER, we can define a 
normal distribution for the variable £(D,,D,), D5 eR. We 
assume that all of these distributions have the same expected 
value U3 and standard deviation G3 (i.e. they do not depend 
Come nemnnG pyrdite | De) Similarly, we define the other three 
classes of normal distributions for £(D,,D5), £(D,,D,) and 


£(D,,D5). These distributions are summarised in ‘the table 


below. 


"Por later development, it is sufficient that the random 


y £(D,,D,), D.ck, be normally distributed 


variable a 5 

DER 
with 3 and standard deviation O3- Similar remarks apply 
to the next three distributions. However, for ease of 


presentation, we choose the approach as presented) here 
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. Standard 
Variable Population Mean Deviation 
Ie ee eee 
For each Q, £ (QD) DeR U Oo, (see figure 
if 1 
3(a)) 
For each Q, eae its) Deo u Oo. (see figure 
2 2 
3 (a)) 
For each D,eR, f(D, Do) DER Wy J4 
FE Dae) i : R 
Or eac el £(D),D,) DJ eR Wy oy 
For. each Diel, £ (Dj ,D5) D5el Up Oo, 
For each D, ER, £(D,,D5) Doel Ue Of 


The last four density functions can be obtained from 
the first two by proper substitutions. 

As in the Swets model [Swets 1963, Brookes 1968], we 
have made two rather strict assumptions in the above discus- 
Sion, namely, the distributions are continuous and are 
normal. The distributions may be approximated by continuous 
curves if the collection size is very large. By the Central 
Limit Theorem [Feller 1967], it may be argued that the dis- 
tributions are normal. - These assumptions are recently 
questioned by some authors [Heine 1974, Bookstein 1974]. 
Specifically, Heine [1974] doubts that the distributions 
are normal. No extensive experiments have been performed 
to validate or falsify the assumption. However, Heine 
admits that "simulation studies carried out indicate that 
the assumption is not seriously in error." Moreover, the 
experimental results by Swets [1969] and the explanation by 


Brookes indicate that "the probability density functions 
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are Gaussian (normal) or very nearly so." 


Based on the above definitions,the total numbers of 
relevant and irrelevant documents retrieved by 9 at thres- 


hold value T are 


Ue S 
sane Hf Ay a “1)2) ax and Sl ‘gh exp (-= : P2) 2) ay 
V2 Oy i Se 2? oy = oe ee 5 oan Prog O5 ae 


respectively, where Ky is the number of documents in R and 


k, is the number of documents zp gael 
If the set of index terms representing the document 
are chosen appropriately, the set of relevant documents 
should be close together while the irrelevant documents are 
dispersed in the space. Mathematically, this means that 


Ug>He- Let D., be a relevant document relative to 90 and Ds 


be an irrelevant document retrieved by Q. D. may Or may 


not be retrieved by Q. In the former case, 0 and De, have 


quite a few terms in common. Similarly,there are many 


terms in common between Ds and QO. ‘The fact that DS and D5 


are classified in different categories with respect to Q 
makes it extremely unlikely that Dy has a high correlation 


with D., , assuming that the relevant assessment is correct. 


(If two documents are close to a given query, it is not 
necessarily true that they are close to one another). 
Tf D. is not retrieved by Q, then there is practically no 


2 
relation between Do and Do. Hence, in either case, the 


COrrelation OL Da with a "random" document D5ER is about 


the same ae that of Da with an arbitrary random document. 
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Thus, the average value of £(D),D5) ,D,€R, could be the same 


as that of f£(D},D3), 5 el - {D}}. Since D} belongs to I, 


the situation Ds = Dy Ly the distribution of £(D},D3), D5e1 


(please refer to the fifth distribution of the Table) must 


occur. Thus, the average value of £(D),D5), Diet us Slightly 


greater than that of £(Dj,D5), Die —(p'}.°0On the other Bana, 


the situation D,=Dj in the distribution of £(Di,D5), D5eR 


(please reter to the fourth distribution of the Table) can 
never arise. As a consequence, the average value of 


£(D,,D5) bre lows. oreater than that of £(D},D5), Dee Ries 


2 2 


Ue > Uy)s though it is. expected that the difference is 
really very small. If the query Q is properly formulated, 
we may expect that it is on the average closer to the 
relevant documents than to the irrelevant ones. It follows 


that 1 Hence, throughout this paper,it is assumed that 


ig; 28 
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Taking the inner product: of a relevant document 


DeePewiro, both sides of (2.1), we get 


a onD) SEO oD) ea ee (Dae D eae £(D;,D). 
DER DieI 
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Hence, assuming the mutual independence of the variables on 


* 
the right side of (2.2), the expected value My and standard 


* pay 
deviation Wy of £(Q*,D), DER, can be shown to be 
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respectively, where r and s are the numbers of documents of 
Ryand if respectively. According to probability theory 


[Beller 1936/1], £(0*,D) is also normally distributed. 


DIP Yyecne Vatrlable f(O*;)")) Duer,es normally distri-- 
, * * 
buted with expected value Uy and standard deviation To, 
where 
x — 
Uy = Up + UTUE = BSU, 
and Sy = (07 +: a“ roe ae Bosc 

(4) 


The above relations can be shownby the Figures 3 (a) 
and 3(b). Figure 3(a)}) shows the distribution of the relevant 
and irrelevant documents with respect to the original query 
Q. It is seen that a relevant document is closer to Q than 
an irrelevant document by an average "distance" of (uy-H5)- 
Cleaniyjavi. the dispersions (standard devration) of the 
relevant and irrelevant documents remain unchanged while 


the separation of the relevant documents from the ‘irrelevant 


document increases, better retrieval result is expected. 
Figure 3(b) shows the distribution of the documents: with 
respect to the new query Q*. The new "distance" between the 
relevant documents and irrelevant documents has been increased 
to Wns 2 (1-H) + ar (W4-U¢) + Bs (ie-Uy)- Unfortunately, 

the dispersions of the documents may have increased too. It 
is not clear from these two figures that Q* performs better 


tnaneo. 


Thus, we have introduced a probabilistic modél for. the 
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the next section, based on this model,we shall compare the 


rece ieyalmperlormances of OG and) O*. 
necessary and sufficient conditions 
derived such that O* is better than 
Lo b)yeplane. is, found. 
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the "distance" between the relevant 


irrelevant documents and makes sure 


In particular, some 


on a@ and 8 will be 
Q. A region in the 
this region lengthens 


documents and the 


that the irrelevant 


(Corollary 2.2, Condition 4) documents are no more impor- 
tant than the relevant documents in formulating the 


modified query. 


2.3 Some Necessary and Sufficient Conditions 
Let T* be the threshold when Q* is used to retrieve 
documents. For a fair comparison between Q and Q* in retri- 


eval performance, the same number of documents should be 


mretrievyed by both of them. Thus the following relacion 


holds: 
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T* as a continuous function of a and 8. Moreover, we have 
Lim aoe a es or V226) 
a>0 
B>0 

Definition. The retrieval performance of Q* is said to be 


as good as Q, or notationally Q*20,if and only if Q* 
retrieves at least as many relevant documents as Q while the 
total number of documents retrieved by both queries is the 
same. If Q* retrieves more relevant documents than Q 
while the same number of documents are retrieved, then we 
White ©7-0. 

The following theorem states a necessary and suffi- 


Cienkt cond.tion for OF20. 
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Westirst wprove the necessary condition. Suppose 


Q*2Q, then g,=(T-1,)/0,. By (2.18) we have Jo2(T-H,)/55. 
Fyn 
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two inequalities imply (2.7). Similarly, the surficient 


Hence T*s(T-y,) oi /o,+ i and T*2 (T-u,)0,/0 5+ U The last 
condition can be proved by contradiction 

Although (2.7) 1s a necessary and sufficient 
Condition for O*20, 2 is difficult to visualize 2 region 
an che (a ,6)-plane which satisfies (227). In order to taci- 
icatewruclNer Ciscussion On this aspect, Let Userrret 


parametrize a and 8 by two new variables m and t such that 
Cr = mt and 6s.= €, m20;,. £20 U2) 


With the new parameters, (2.]), (2.3) and (2.4) can 


be reexpressed as 


* 
Cr Ore Me) Dr tee a, 


Dein = Ohe iy (2.10) 
a L 
* * Sion Ds AED. 
= — ann oO + (6) p 
Hy yt mth. tugs oF (OF + Te 3/¥ ie 4/8) and 
* ot eM De em ho 
=: : a = oO +t~O 
Ho Hat MEU ¢ tue, 5 (O5+ Tiles 6/t 2 «/S) 
(2 Si} 


We now elaborate on the geometry af the (a,8)-plane, 
as applied in this paper. The plane is meant to be the 
positive quadrant i.e.020, 820. A Point an the planeais 
either represented by the rectangular co-ordinate GOERS ee 


by the parameter (m,t). To avoid the qditeicnlei css io r,eprc- 


senting points near the a-axis in the latter form, (2. 9) scan 


be rewritten as 
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Ce oy = Cee eS O (2.9a) 


inere are two forms of lines used im this paper: one 
is characterized by a fixed a (or 8) and the other by a 
Pied meg the former 415 a lane paeralled to ine g-axis (or 
the d-exie) and the latter 1s a line through) tne oniaginwicn 
1 


Slope r/(ms).' For our future discussion, a region in the 


plane 1s either a rectangle bounded by four lines paraliel 
CO the ia and the exis Or 45 “a Sector bounded) by two. lies 
through the origin. 

According to (2.1), a moditiited query, O* 1s unicuely 
determined by the values of a and B, given the original 
query 2. Hence, there is a one - one correspondence 
between the (a,8)-plane and the set of all modified queries 
defined by (2.1). Any geometrical entities (e.g. points, 
lines, regions) are a subset of the (a,8) plane and hence 
can be interpreted as a subset of the modified queries. 

The following corollary gives the minimal possible 
value of m such that the whole hal&line defined by (2.9) 


Saristiwess (2.7). 


Corollary 2.2. Let the number of relevant documents and 

the number of irrelevant documents retrieved by Q be non- 

ZeLO, tee. .r20 and s70. Let, O* be obtained (by (2.1.0) and 
*k 


* . 
have standard deviations ory and J, as given DyeGe tal) ee eee 


eg 8 


The a-axis and the g-axis are represented by m =” and m= 0 


respectively. 
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the following hypotheses are true, then Aa) 


(1) The number of documents retrieved by iO-at ‘the 
threshold T is no more than half of the total 


number of documents. 


* * 
(2) 01/9, > J5/T5- 
(30) or OTS a are te yar se 


2 a Dae 
(4) m= a aa ea ec 


DEP ELD 
(H37H,) 057-93 (b,-H) iin 


2) 
PrLOOtm oy (2.1) and the anequality Gees 1/2 =? Aba ey igone 
2 
x 2 0, we have (by letting = enc) ae + Eo, (50-1) 


[(f-u,)0,/9, a ws] = [(T-u,)0,/0 + 


* eS * ane * 
(Hy -HS) — (u,-H5)o,/o, + t(T-u,) (c,/o,-0,/o 


mt (ua-ug) +t (ue-uy) > Gy -u,) tf itm t*oS/ (ros) +t o,/ (soz) ] 


+ {(t-u,) (05/0, = o5/05)} 


2 [m(W-u¢) a (Us-Hy) = (uy-u,) (m@03/ (x04) +04/ (809) 7/71 


* * 
Petri ) (G,/04 = C5/55)]} 


: (2.12) 


2 
By Hypotheses (3) and (4), we get o4 [m(uz-He) + 
2 Dene 2 2 ee NS ee re 
(He7Hy) | = a [m (W347) + (Ue ep) ae Oy Mo) (m 03/r 
7 Sais Taking square root on both sides, it shows that 


the first bracketed expression of (2.12) is non-negative. 


DD Doom 
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Next, we claim that Tey: Otherwise we would have 
ee as 5 = My: Then, the lower limits of the two integrals 
on the right-hand side of (2.8) have negative values, 
implying that Q retrieves more than half of the total 
number of documents, contrary to Hypothesis (2), the second 
bracketed expression of (2.12) is non-negative. 


Io Lollows, trom Thecrem 2. Ll )that 07-0. 


Remarks. This theorem suggests, within the framework of 

this model, a rather practical way to improve retrieval 
results,provided that the four hypotheses are true. Hypo- 
thesis (4) specifies the sector in the (a,8) plane whose 
points define better queries in retrieval performance. (The 
sector is bounded by two lines. One is the q-axis. The 
other is a line through the origin whose "Slope" m satisfies 
hypothesis (4)). Now, we shall attempt to examine to what 
extent hypotheses (1) - (3) are realistic. 

Usually, a query retrieves only a small portion of 
the documents and thus hypothesis (1) probably holds under 
normal retrieval environment. Owing to the lack GE antor— 
mation concerning the standard deviations,it is harder to 
suSstity hypotheses (2) and (3). If the collection of ‘docu— 
ments are properly indexed, it is expected that the average 
closeness of two relevant documents relative to a given query 


is a lot larger than that of a relevant document with an 


irrelevant document. Thus, HW, >> Wes For a guery which 


requires the feedback operation, we may argue that Wy can- 
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not be much larger than Uo. Since a document usually has 


more terms than a query, the assumption that UH. is not less 


3 


than My is justified. As a consequence, geht. 2a os Se 


Thus, a case in which hypotheses (2) and (3) hold is that 

Oo; eer tee S GL It is obvious that -the above process can 

Dbeyiterated to produce better, and better queries, 71.6. 12 .0* 

is obtained from Q by (784), then O**) can be goreenetrom 
; * 

OF) by (a57B 5) With O**8250* = 0 and so on. Since Py and 


* 
Ho#Uos it may be necessary to choose (a57B5) such that ao 


P Andee 
OF B5#By- 

Sometimes the original query Q is formulated so badly 
by the user that not even a single relevant document is 
retrieved. Under such a condition, no value of qa would help 
improve the retrieval performance. The following corollary 


Statesschae 2f aesmall value for (6 is speciimed in (251). 


then Q*>Q. 


Corollary 2.3. Let the number of relevant documents retrieved 
DV R@EDeE Zero, sen © = 0 and 0*— bes obtained ay ee 
the following hypotheses are true, then Q*>Q. 

(1) The number of documents retrieved by QO at the 
threshold T is no more than half of the total number of 
documents. 

(2) 0,0, 2 0,0, and g,(i,-u,)>¥s oe) 


(3) o@ is arbitrary and 
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The last few corollaries are based on a number of 
assumptions about the expected values and standard devi- 
ations of the normal distributions. Our next result does 
not depend on these assumptions. 

Theorem 2.4. Let the number of relevant documents and the 

number of irrelevant documents retrieved by Q be non-zero. 

LetIO  beroptained @romiOtby (20) 2) [i eo) andeo eres subi 

cLaencly ismall, then O7-0: 

Proohe) eLet any point. (0,6) ain the positive quadrent.os. the 
(@,;6)—plane be parametrized by (2.9). We first consider the 
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iE io exp (-y~/2) dy + k ie Bs (ee 
1 gy 2g 
(2edes) 
ee) 2 ae 2 
= a exp (-y /2) dy + Ko expieVve / 2) oY 
(T-u,) 70, (T-u,)/9, 
where g, (m,t) = (T* (m,t)-uF) /o#, phan Men Pa einve! uae us, oF 


oF ace uerined as im (2. Ll). 
The number of relevant documents retrieved by Q* 


at the threshold T*(m,t) is 


i Cia se (k,/V¥2T) ie Segoe on eke, Veale) 


g, (m,t) 


The expression of 9T*(m,t)/dt can be obtained by 
differentiating equation (2.13) with respect to t. Sub- 


Stitucand) this expression £0 d1/fot, we get 
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((&5k6,6,)/ ban (05)? (05) 7 (G,/o,4+6,/0,)1} 


{-t7 (nyu) (no? /x + 02/8) (mo5/r + o4/s) 


Meee Zoe 2 2 Zaee 2 
ck [o5 (ue mu.) (1m J3/X + o/s) + Oo; (mu ,-u,) (m O_/r “ of/s)] 


* 


- t{(T (m,t)-u,) (m°o2/r + of/s) 04 


ae (B" (m,t)-p,) (n®oS/x + 


04/8) 05) 


ne 
+ o,05[(ue-uy) + m(w ue) ]t 
one oe 
1 ita Hil) att 5) 
Where G. = exp(- a - yas) tees Ad RP 
one 
i 


Byo2. 6), T*(m,t)ebeas t20] It 1s olso Obviousethat on 
a fixed value of m, G. and oF, i=1,2, are positive and are 
boundetyas i tends to 0. Thus; <@L(m;st) fot +0 for surti— 
Ciently small value of t. It then follows, that O*%=0 for 
sufficiently small non-negative values of a and g (with at 
least one of them strictly positive) in the sector defined 
DY “6205 and.o5) = isi. 

For the case where m>l, we can set n = 1/m and para- 
MeLLEZe wa, oo) bY Gr = t, BS = ne imscead .of “(279)58 “Proceeding 
Simlarly as above, we can then show that Q*>Q for suffi- 
ciently small non-negative values of a and 8 in the sector 


defined by 620 and ar 2 6s. 


244 SOprinal Values for o and =p 
Wea mow attempt to find the poantia,2) imethe (Go, 2) 
plane which maximizes the performance of the new query Q*. 


Although wesdo not succeed in getting “a closed orm formula 


_ 
~y 
ot a 

7 


i a 


~ 7 
- re i wy 
cr ti 
Te 
” 
| ‘ sf 
¥ 
: 
- 
j er, 
\ 4 + ] 
pire J 
s 


F is: ane Tw at 


1 aaa 


30 


fOr che point, 1t is found that the point must lie on a hyper- 
bolic curves We shallifirst shew that the point mist lie tn 
definite region of the plene. 

Using the notation of the last section, it will be 
shown that for sufficiently large t, I(m,t) is decreasing 
for any value of m such that ~ > m => 0. Furthermore, the 
Optimal point cannot lie on the a axis and the 8 axis. 
Tneee technical lemmas 2:5-2.7 lead to the results stated 
im Bheorem 2.8. For their proots, please refer to Append: L- 
bemma 2.5... There exists a constant c,>0 and a constant 


a 


i Ue such that forsevery tet every O<msl, the threshold 


I 
* 
ti (m,t)sc,t. 


Lemma 2.6. There exists a constant te such thatelor every 


ile 


tat, Zien miners WES ah achat Guianes) <0) 
Instead of expressing I in terms of m and t, we may 
Witten Leasea LUuNCGtLIOn Of Gand 6 by means Of (2.9) and (2a 415 


SW > ae 


Lio, 6) = (k,/72T) f Bat Poe 


Wenn ee eo boo. (0,6) 20) and an /oe (o,.0)>0 whem W,.s-0l: 

Theorem 2.9%. Let the number of relevant documents and the 
number of irrelevant documents retrieved by the original 
GucrynO be we ana Ss respectively, WLbleoy StU eset (a, 78,) 
(t./r,t./s), where t, is determined in Lemma 2.6. Then there 
exists a point (0,,8,) with 0<a,<a, and 0<64<B, such that the 


* . 
Q defined by the parameter (a,,8,) is at least. as good as 
the Q” defined any (0,8), &,b=0 Furthermore, the point 


— Pal Soren) 
se ; wenn’ dddacith: Sues 

= serie oe rod Ww) oe 2 Sac : ; : 

’ 2 19 We } 


aa we AT Au). “oa 
7 
ena’ ' ; j iC 
s 4 , 
P| 
c] 
yo Dey Gt ee 
: > - 
e. ; 
4 Ta aikoert cower i WG, » 
, j 
r i 
) r 
r Ur 
, 4 oe Ts 1 yn & | 


(a5,6.,) must lie on the curve 


22 2 
a (0604-0505) (We-ug) +op (Oo 
=" (ab) (0% 


2 A Ze? 
(2-16) 


Proof. Let W be the closed region bounded by the lines 


L,: a=0, 0<B<B,, 

L.: B=0, Osaso,, 

L.: OF Oss O<B<81, 
and Ly? B=B4, O<as o, 


Lew the maximum of I(0,6) “in W-occur at (15185). We now 


Snow thaw tos, 6 27 (G,8) for any 020) e20uane (a,,8.) lies 


vA 2) ra 


Hy sete winterior of W. 


Let (0.,8 be any point in the quadrant, o20, 6-0 


5) 
but be outside of W. ‘Then 012014, OF Be 7B - Let mz = (o-r)/ 


(Bes) anawi. pe Ene line connecting (0. ,8.) and Ene Onloin. 
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Li, intersects either L, ele L,- Without loss of generality, 


Suppose L, intersects L, at (2,,8,)- Le can be parame- 
Eni Zed spy so = met/r and 8 = t/s where O<t<B, Ss. ies Asvee, Bak 
Cala, stacespoutLon wot the stanS .acom (a, 78,) ee (a¢,8.) cs 
defined by t,<t<6.s. By Lemma 2.6, I(m.z,t) is decreasing 
vee, wastilc: cleleientesgmrese Lee Thus, I(m,,Bes) < T(me,t.) Lewis 4p 
T(o,,8.) < I(a,,8,)- Since T(a,,8,) < T(a5,85), Le elol lows 
bhatt T(o.,8.) < I(a5,B85)- 
By Tiemmat2.7 but es (obvious) Ela tae (a po) + Cannot 


Acai eee aeMax tL mum Of Ly or Lo. To show that (a5,85) cannot 


Dee Ong: consider any point (my,t,) on Ly where m,>0. 
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there must be a neighborhood of (my,t.) Within which dr (m,t)}/ 


dt is negative. Thus there exists te<t. SUC Viet, obit, ©) 


dt is negati < = 
gative on t, t<t., M=m,. Hence I(m,,t,) < I(m,,t), 


<a = 
te e te. IBUE 1M m 


at te<t<t, are interior points of W. 
PAS: (15,8.) Cannow Le ron Ly: Simian ice, (a5,68,) cannot 
lie on L,. As a consequence, (05,85) Satisfies dI/dat(a,B) = 


dI/d98(a,8) = 0. From 3I/3a(a,8) = 0, we obtain 


* 
1 CCR {1/a(oé ( ee Veh Gane) (oy (an 
Foe 2 2 2 
a (uy03 (05) 1996 (04) )}. (2.17) 
SimLlarty from 1/33-.0,8).=..0,Wwe sand 


* 
T (0,8) = {1/B(o2 (03) *-04 (05) °) }-{(ug-u,) (0,)2(05)7 = 


* 2 aS DP x 2 xD 
B(W, 9, (95)" - W505 (0,) yey (2.18) 
MogatLnGg «(2,1 )erancd a2 obo tae Oy eel O LLowse 
A numerical solution for (a,,8.5) can be obtained by 


substituting the relation between 0, and B5 and equation 


Ce Oyen cO-equation. (230. 


2.5 Experimental Results 


Experiments are performed on a collection of 200 
documents on aerodynamics (CRN2NULDOCS 200). The 42 queries 
(CRN2NUL Quests 42) are used to retrieve the documents by 
means of the simple matching function defined in Section 
hele ROT Cachn Query, ene threshold is set such that the 
first ten documents which correlate Highest with the query 
are retrieved. Of the 42 queries, it is found that 8 
queries do not retrieve any relevant document. The first 


li2 “queries, each retrieving at least one relevant document 
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are chosen. A number of (a,8) values are selected according 
to corollary 2.2 i.e. the new queries defined by the (a,8)'s 
should be at least as good as the original queries. With 
Lhe vapsence. OL anformation about the Standard deviations 
ana the expected values of the random variables defined in 
section 1.1, we arbitrarily divide the a-g8 plane into two 
Palcs by gche line or = 8s. The set of (o,8) values, tested 
Satisfy or>8s. The exact (a,8) values are shown in figure 
4(a). Each point in the diagram is represented by a number 
from 1 to 5. The average performance of the new queries by 
the different (a,8) values with respect to the original 
queries are shown in figure 4(b). The x-axis of the figure 
is the recall value averaged over the twelve queries, where 


recall is the proportion of relevant documents retrieved. 


The y-axis is the averaged precision value, where precision 
isthe | proportion of reernieved documents ~nat are relevane. 
( detailed discussion of recall and precision can be found 
inepSatron 19681). Thal0" ingthe Lrqure represents whe 
performances of the original queries, while the performances 
ef the new gueries are indicated by numbers £from 1 to 5 
corresponding to the numbers assigned in figure 4(a). For 
example, the 'l' represents the performance of the set of 
new queries defined by formula (2.1) with a= 100/r and 

8 = 1/s. It is found that using any of these (o,B) values, 


all of the queries show some improvement over the original 


queries. 
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Fig. 4(b). 
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Performance of the new queries (region of improvement) 
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To Dilustrate that mot all (a,B)values with g20, 
B20 define "good" new queries, a number of (a,8) values are 
selected in the other part of the plane i.e. ar>fs. The 
exact (a,8) values are indicated in figure 5(a), with the 
performances of the corresponding queries in figure 5(b). 


Te is "found that of the fiwe tested (9,6) values, only one 


defines better queries than the original ones. This parti— 
culamn value is) closer to the line ar ="@s than the other 
poincs=) While the line Gr = 6s may not be the exact line 


separating the parameters for defining "good" queries from 
those defining "bad" ones, it is seen that the experimental 
results obtained are consistent with the theoretical pre- 
G1Cevoussoim ‘corollary. 2.2. 

In the case that the original queries do not 
retrieve any relevant document, it is sufficient to set the 
parameter 8. The set of eight original queries which do not 
retrieve any relevant document are used to test which values 
of g should be chosen to form the new queries. According 
fovcorollary 2.3, 8 Should gnot be ser tod lich] A number 
Gti Gevalves arer andicated inwmigure’ 6 (a) wile yrhe pert or 
mances of the corresponding queries in figure 6(b). Since 
the original queries do not retrieve any relevant document, 
they have zero precision and are therefore represented by 
the s-oei1s. Itus seem that with ~e—/ceworeGe = 1/55 1 7e-; 
small values of 8, the improvement is highest. 

Tess Suspected that the maximum improvement occurs 


ae the intersection of the curve defined by equation (2.16) 
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Fic. 5(a). Points on the (« — 8)-plane tested (not to scale) 
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Performance of the new queries (region of deterioration) 
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Fig. 6{a). Points on the f-axis tested (not in scale) 
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Fia. 6(b). Performance of the new queries (testing for the parameter £) 
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and tne line or =. 6s. .A number of points: on fhe, Line 
ar = Bs as well as points in the neighborhood of (1/r,1/s) 
are tested. It is seen that the new queries with the 
parameters ({l1/3r,1/3s) exhibit the best overall performance, 
though the difference between them and the queries defined 
by the parameters (1/r,1/s) isi very slight. “The pertor— 
mances of the difference queries are illustrated in figure 
Wiper 

Although the validity of Cranfield relevance judge- 
ment has been questioned [Harter 1971, Swanson 1971], it is 
expected that as long as the relevant documents are 
"clustered" together and the irrelevant documents are 
"dispersed", similar retrieval results would be obtained 
for reasonable yet different relevance judgements. This is 
partially supported by our first set of experiments in which 
every new query modified by the parameters of figure 4(a) is 
atileastlas! dood as 1ts corresponding origina lrquery. 
Furthermore, the new query as defined by (2.1) varies accor- 


ding to the relevance assessment. 


Pe GommGeneralvZation 


THe analysis carried out in the previous Sections 
is based on the simple matching function. It is easily seen 


thet tie, approach can be generalized to any matching function 
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is the cosine functiont used in the SMART system, equation 


(2.1) can be modified to become 


Cae Oe cee) Du ale ee Daya De 
Dek Ds sol 
i 
where [D. | is the %,-norm of D,. It then follows that 
pO* cos (Q*,D) — (0*,D/|D]) = (0,D/|D|) +e ) (.7[ pt os [Dy) 
D. ER 
= E ib (Dp! t/|D3 | D7 Die The approach used in the previous 


section can then be carried over in the case the cosine 
function as used, if we replace X by X/|xX| where X 45 4 


document. 
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CHAPTER 3 


CLUSTERED FILES 


3.1, “freroduction 


There are two types of search strategies (which have 
actually been implemented in the SMART system): Full 
Search and cluster search. Full search is simple: the 
correlation (closeness or similarity) values between the 
query and all the records are calculated. For cluster 
search, the records have to be classified, by some clustering 
aiGSricthms, into a number of sets, called clusters, wit a 
representative constructed for each cluster. Instead of 
comparing the query vector with each record, Q is correlated 
with different representatives. Based on these correlation 
values, the system decides which clusters are to be searched. 
The data base can be arranged in a tree-like structure so 
Eau there might be sub-clusters within) a Cluscer ana so) on- 
Now the calculation and subsequent ranking of records in 
order of the decreasing closeness are usually both laborious 
and time-consuming. As the trend now is for larger and more 
diversified data bases, cluster-based retrieval is distinctly 
more advantageous if the increase in efficiency does not 
incur serious loss of effectiveness. The loss “is due to the 
fact that some of records should have been retrieved had 
full search been conducted on the whole data base, but are 


"missed" simply because they are stored in the part of data 
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base not searched. 

The meaning of effectiveness is slightly different 
from what has been defined. No attempt is made to evaluate 
the relevance of the retrieved items as in chapter 2, and 
the concept of relevance does not apply here. Our intuition 
will have us believe that the higher the correlation value 
between an item and the query, the more probable that the 
ECemerswrelevant to the query.) Indeed it as based on this 
principle that the best-match method was devised. However 
their exact relationship will not be elaborated here. We 
shall strictly adhere to the loss (as mentioned above) as 
the yardstick measuring the system effectiveness. It is our 
intention to do away with the human factor, i.e., relevance 
judgment and emphasize on the clustered files as used ina 
general context. For this reason, we will use “record” 
imstead of “document" throughout this chapter. Correspon- 
ding to "relevant documents" with respect to a query Q, we 
define "desired records" as {0|f£(Q,0)2k,where k is some 


Constante ©. 1cs a record). 


SeeerRe lated Work 


In reviewing the literatures on clustered-files, one 
can find most of them focusing on clustering or classifi- 
Caeion algorithms and their properties. Experiments 
(Burkhnara ec al) 19/3, Salton 1971] have been carried out 


on searching in cluster-based files. Generally they yield 
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reasonable retrieval performance. Rivest's thesis 

[Rivest 1973] examines a similar problem from the theore- 
tical point of view and concludes that clustered search is 
most efficient among all "balanced hashing functions”. 
However, the amount of computing time required is still very 
substantial. A further reduction in computing time can be 
achieved by examining only those clusters whose representa- 
tives are ‘sufficiently close to the query Q. In many 
applications of on-line: information retrieval, in particular 
in document retrieval or in situations where fast response 
time iS required, it is sufficient to retrieve a majority 
of the: desired records. It is therefore of great interest 
to obtain the percentage loss of desired records. With this 
information, the system manager (or the user who is "on- 
line") can decide on the trade-off between efficiency and 
effectiveness. 

There is a wide variety of retrieval methods 
(WweneRanspergen 1974,"Ssalton, 197 1i\") = Ditiherenaretacesri 1s 
Cation algorithms specify clustens intdiliercn ways; ) tor 
mistance © some definitions Of clusters perma a: cecord. Lo 
appear in more than one cluster, while other definitions 
forbid overlapping clusters. Clusters may either be 
organized in a tree-structure or they may occur in only one 
level. Here,a probabilistic model is constructed which 
contains all the essential characteristics of clusters as 


produced by a wide variety of different aAlLGoOriatoms. eles 
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in this context that estimation of the number of desired 
records in one cluster with respect to that of another 
cluster is made, permitting an approximate percentace loss 
of desired records to be obtained. In this thesis, the 
variation of the ratio of the number of desired records in 
one cluster to that of another under changes of different 
parameters 1S considered. Empirical results are also 
obtained, based on which guidelines are provided for setting 
up the representatives of different clusters and searching 
the clusters. 

The framework on which Rivest's analytical work is 
based differs substantially from that presented here. 
Specifically, the differences are as follows: (i) He uses a 
"distance" function which measures the number of attributes 
contained in one vector but not the other, instead of a 
ZolLOseness Or “Similarity, function. In’ some applicactons 
of information retrieval, the matching rather than the mis- 
mMoceminigeoL tne attributes 1s of importance, @i-e.,9a 
G@istance function may not be an accurate inverse of “similarity 
function in those applications. (11) The search algqoritim 
in Rivest's thesis obtains all the desired records at the 
expense of more retrieval time. Thus, there is no need to 
estimate the number of desired records in one set of clusters 
relative to that of another in his framework. (12a) it 
Femaccumed that the field is randomly chosen Subsersour, oF 
eer oe possible records. This assumption ie etabber sUnrea. 


listic in many applications of LOM wa 1 reua eval. 
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Bentley [1975] attacks a similar problem using a multi- 
dimensional binary search tree. His method also obtains all 


the desired records at the expense of more computing time. 


S.508 8h Probabilistic Mode! 


We now state the assumptions on which our analysis is 
based. 
(De eihegaltributes Gn a cluster ware: independent. iceu. etic 
eceurrence of an attribute (or a) sét of attributes) has no 
relation with that of other attributes. Such an idealization 
is adopted by a number of authors in different contexts 
(eto. Bookstetnv1975,"Schkolnick 975,019 (Gl) 
(ii) All records are assumed (conceptually) to be n- 
dimensional vectors, where n is the total number of attri- 
butes inthe set of all records, and each attributetis 
Gither Olor 1. However, the records may be stored by 
recording only the positions of the non-zero components. 

With the above assumptions,the expected number of 
records in a cluster 6 of m records having i attributes in 
common with a query Q can be computed as follows. Let %2i 
be the number of attributes (non-zero components) of Q. 
Let the attributes be denoted by Zarlsjsh. The probability 
thateany record of G contains ca is Ts = MAN where i is 
the number of records in the cluster & of m records having 
Chemjtieateribute. Similarly, the probability that a record 


does not contain Z. is Oley Using the assumptions, the 
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probability that a record contains 2z. ie Ages ep Bee Oi Ee 
‘ i 2 i 
! Q 
Ay Firat are ae eat : viet (I=-g.. 
I 9g Tr g Ne) There are 
Ji4] Jy k=1 Jz k=i+1 Ie 


u 
| different ways of choosing i attributes out of Q. Thus, 


the probability that a record has exactly i attributes in 


common with O is 
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where ) is summing over all the i) POSSI ple *cioTrecesemle a 
Q z 


i 
is obviously symmetric with respect to the y's. For the 


sake of simplicity, (3.1) can be denoted by Clgyr-- +19 ig ps 


Ee 
The expected number of records having k or more attributes 
& 
in common with Q is then given by m } C(gyr-+- 0G, ri). 
i=k i 


We now define the representative R of a given 
cluster ©. 


v7 


Definition S.1. Letts consist of the records to,} Ree ear: 


theljen record is given by the binary vector eo i at eee 


On). het RSV pieieronay ae) be the (vector) sum of the m records, 
jn ; 
m 
where yes y Oj, lzken. » fhe keh component of @tie repre 
j=l 


sentative R is then defined to be 1 if Sy=Y)/mst (where © 
Ueken and t as any arbitrary number Ssacisiying U=t=1) ana 
otherwise. 

By the above definition, if an attribute oecu;s 


SUrticrently aiter in the records of a cluster, then it will 


appear in the representative. There are two reasons for 
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defining the representative in this manner. Firstly, the 
amount of storage needed for such a representative is 
minimal. Those attributes which are MnLikety Oe Occiie in. ra 
randomly selected record of the cluster (i.e. their Probab 
lities of occurrences are less than t) are ignored. —Omisy 
the positions of the non-zero components of the representa- 
tive are actually stored. Secondly;the computation of the 
Correlation of a query with) a representative 1s ekructenc: 
The number of comparisons needed to find the number of 
abeributes in common is’ at most equal to the sum of the 
numbers of the non-zero components of the two vectors, if 
the positions of the non-zero components of each vector are 
stored in ascending or descending order. 

Asementaoned in Sectlon 2.l, theruce: Ss aqueny is 
compared with the different representatives and the corre- 
lation decides:;which-.clusters will be examined. Thus, it 
is necessary to relate the correlation of the query and the 
representative of a cluster with the number of records in 
the cluster having a fixed number of attributes in common 


with the query. Let the probabilities of occurrences of 


the 2 attributes of Q in a randomly selected record of a 
cluster G be (Jy rFor--- 15g) and Ss) be the number Gh eattripures 
in common between Q and R, the representative of G. By 
the definition of a representative, s of the g's are greater 


than or equal to t and the other (%-s) g's are less than t. 


Let us denote the probabilities by (Dy ree e Por Gig y ress dy) 
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where each pst and each q<t. Thus, by our earlier discussion, 


Lae expected number of records in — having k or more acteri— 


; . : Q 
buces in common with © is m . Cpr. 
Hee 


This expectation value is conditional on P and se Lies, 


22 1P or Aoyyr ee Igri). 


stl<j<2. However, because of the way a representative is 
Set ineo sche values of the o's and q's are unknown; at least 
this information cannot be readily obtained just by inspec- 
bang tne Tepresentative. Thus, 1n the present context,e1t. ts 
more appropriate to consider the average behavior of clusters 
having the same characteristics (i.e.with their representatives 
having the same number of attributes in common with Q, the 
same number of records, and the same threshold t in the 
definition of the representative) than to examine the 
eee of an individual cluster. By average is meant 
that the p's are allowed to vary independently from t to 1, 
and their distributions (which are not known) are identical. 
SiMiteriy re tne gd Seale assumed to pe independent eande cen] 
tically distributed between 0 and t, though the distributions 
Of a Dp and a q will be different. Thus,the "unconditional 
expected number" (as opposed to "conditional expected 
number" as introduced earlier) of records in G having k or 
more attributes in common with Q is mE (y C(Pyre-erPoe 
(p,q) i=k . 
Seon gyorg Spec where E is expected value over the indepen- 


dent random variables p's and q's, and m is the number of 
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rerords In a oe 
By probability theory [Feller 1967], 
E(x)+x,5)=E(x,)+E(x,) and, i xy and Xo independent, 


E(x) X}=E(x))E(x,). The above expression can then be 
Ne 


duced t Spa eae DLS lepann = 
reduced to a C(p,, PorAciyre+rdpgri) where p,=E(p) 


is the expected value of each p,, 1siss and q.=E(q) isethe 
expected value of each Gas Stisis2: In ordex to further 


Simplify our notation, we shall represent 


QR 


Q 
a C(Dy ress tPordey grees 1p ri) by 4 C(2,s 4) wheres 
i=k i=k 


indicates the number of E(p)'s; i.e., 


x i 
Ce) ChE (D pee (ok (GQ) (eae) 
Ss <—s 
We ycompute each C(l7s,4a), Kaas, according te the detinitron 


of C: since there are exactly i attributes in common between 
the query and a desired record, there are 2-i attributes not 
in common. j of these £-i attributes (0<3<%-1) can be chosen 
out of the £-s (1-E(q))'s and the remaining (2%-i-j) chosen from 


Ele set -h ip) ) Ss. Summing 9 from 0 to 2-2, 


TE Ee IRE A ITO ee TIE ee NN 
t+Owing to the absence of knowledge of the exact values of the 
Di suand the d's, uncertainties are introduced into the 
expression mC(pjy,----1Po1Isy r++-+/dgri),the expected number 
Of records, haying 2 atEributes in common with the query. 
Because of the independence assumption, the expression 
& . 

m E SS ClO BOsmece heya dort a8 depends only on the 

ee ; t paragraph for 
expected values of the p's and the q's (see next p grap : 
explanation). Hence, the term "unconditional expected number". 
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where some of the terms in the sum may be zero. 

Thus we have found a relation between the pees 
tion of a query with the representative of a cluster and the 
unconditional expected number of records having k or more 
attributes in common with the query. In the next section, 
we shall compare the average behavior of a cluster relative 


Lo that of another Variations Of different parameters. 


3.4 Analysis 


Consider the case where there are only two clusters 


and G5. Let Q be the query submitted and 


v=) the number ol Sttributes of O- 


The retrieval depends on the threshold k,namely, 


k = the minimal number of attributes which a record 
should possess in common with the query in 


order to be retrieved. 


Suppose now it is found that the representative of GR 
has fewer attributes in common with Q than that of Oo. Cie 


R The purpose of this analysis is to examine the effect 


»)- 
on retrieval performance if cluster ©, 15 not retrieved. 


The main criterion of accepting or rejecting ©, depends on 
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W, the ratio of the expected number of desired records in 
G1 to oie tear Go. If ptuis suffseientiy small, then the 
amount of desired records in oy is likely to be scanty 
compared with that of Cos Thus the retrieval of cluster Gy 
may not be worthwhile in terms of the time required to 
examine all the records in ys 
G, and 6. will hitherto be referred as "average 

clusters which are defined in the last section, for clusters 


with same values for 


s = the number of attributes that Ry has in common 
with Q, namely, £(Q,R,), 
std = the number of attributes that R, has in common 
with Q, namely, f (Q,R,) ,where dis an integer: >0, 
and 
t = the threshold value for determining whether an 
attribute should be contained ain the represen— 
tatives. 
ibe Ny and Ny are the total number of ee 
contained in a and Go respectively,then Oe nee ca 
Ue 
) C(2,std,i). However, since Ny and Ny are fixed relative 
i=k 
to the five parameters N,/N> will be disregarded in this 


analysis ethus, the ratio becomes 
Me iy 
W= ) C(Qrps,i)/7 } CUrstd,s) - (24) 
i=k i=k 


With all five parameters given, W can be calculated. How- 
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ever'it is désirable to derive expressions which show how W 
depends on each parameter while Some Or jal Li .others cane 
Ditcdm winbiact’, based.on auch results, it would be possible 
to indicate regions of the parameter space mn which W takes 


on a small value and hence the rejection of by is justified. 


Hopefully, the analysis can also lead to a better under- 
standing of cluster-based retrieval process, resulting in 
more effective implementation of classification and retrieval 
algorithms. 


To start with the analysis,we express W as 


d 
W= 1 a- (any) 


where 


Q g 
C=) C(U stoi .) WC st aa eee (306) 
=k He) 


THe fact that ead and ae Isj=0d, witli be shown 
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as immediate consequences of Lemmas 3.land 3.5 respectively. 


ae W (oa) e As a result, when the difference of 
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the correlations of the query with the two representatives 


Thus (a 


inereases. the average number of desired records a7 one 
cluster (04) Trolative tO tiat OL Llc OLNeE (5) decreases 
rapidly. Let W. be the value of W such that an ordinary 
user te likely to consider the percentage of desired records 
hems likely to find an Gy to be too low for its retrieval 
to be wortnwhile. ff da is the value of d which produces 


W then any value d2d, infers a value WsW, and will cause 
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clusters are involved, the one which is expected to contain 
the largest number of desired records will be selected to 
compare with each of the remaining clusters. The choice of 
We depends on the number of clusters involved, in the sense 
that the value will be adjusted so as to maintain an adequate 
number of records to be retrieved. This observation applies 


throughout this section. 
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Lemma 3.1: Eines Ig AAG C(I gm ++ 1G yoko) 
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+ >} C(Gon ++ 1S yr5) 

Ja5 
‘Blower Sieeaa ele 
Proort *aeby) derinition, we have 
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B k=l Jk k=i+l Ik 
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Line (otek fey J, Can appear as a oh (i.e.,the probability that 


k 
the tarst attribute is in common) on as a ace ) Cave.; the 
k 
probability that the attribute is mot in common). If it 


appears as ag. , then out of the remaining (%-1) attributes, 


k 
we have to choose (i-1) on Wa to: ensure that, elere wares 
k 
attributes in common. On the other hand,if J, appears as 
gmtl—-de)), then. the i attributes in common are chosen from 
Jk 
the remaining (%-1) attributes. THUS; Liew 0 Way as 
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When i is summed from k to 2-lin (S28), the coetrti1— 
cients of Sy cancel each other except the first one 
C(gy +++ 15, k-1) and the last one C(Gy r++ 1F,,h-1). We. can 


then add C19 2-9 71Ty  B)=G,C(G on. = 2495 k-1) to this equation, 


anaetae cesired result fol Lows. 


By Lemma s., (322) and (3.6) 2b ace 0 


Q Q 
a. = ) C(2,stj-1,i)/ ) C(2,stj,i) 
J j=k j=k 
ii . 
[Biqc(i-1.st3-Lk-1) + J c(i-1,stj-1,4) | 
i= 
t-1 . . 
[Eipyc(t-L-stj-Lk-w) + J C(i-1,stj-1,4)| 
1L= 


pete 


since Eigicm@(p). This Leads eo the following lemma: 
Lemma 3.2: W decreases as d increases. 


Proot: »by (3.5) and the tact that each cae Ode plead =c< 
We now show that.as k increases (i,e., the records 


retrieved will be more "Similar" to the query), W decreases. 
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This result implies that when closer neighbors are required, 
proportionally fewer desired records will be expected to 
appear in by telative to Go. In other words if feawer records 
are retrieved, the percentage loss of desired records Gre 

by is not retrieved) will decrease. Let eg be the value of 

k producing Ws (its definition appears earlier inthis 
section), then for a user specifying a threshold kek os ae 


is rather safe not to retrieve Cy: 


Lemma 3.3: W decreases as k increases, i.e., 
Q Q 
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Proof: Let Be=C(2,8,1) and A. =Ci(? Std, i) 0 elite eStore 


to show 


Q Q 
Ba) A) Bade te) Beale ian Goa eed ee (22a) 
i=k + i=k 7 i=k+1 *  i=k+l 


By simple inequality manipulation, it can be shown, that (323) 


reduces to 
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By lemma A4 of Appendix II, the desired result follows 


immediately. 
We now relate W to the other three parameters, 1.e., 


eS ec cl h Again, we shall examine the behavior of W, when 
, Ue 
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Where x=E(q), y=E(p) and — is either E(p) 
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needs to be expressed in another 


(3,11) 


ate eie. 


can be considered as a function of x, y and the 
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Because of the symmetry of O. with respect to the r's, 
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But (3,13) is equivalent to, (3-10) where d=) ands 


and k are decremented by 1. 
Lemma 3.5: W increases as s increases. 


Proof: When s increases by 1, one of the r's in each a. of 


Ener LOLmm 3. Ll). say oo which is E(q), istincreased tovkip). 
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Since by Lemma 3.4, each PaO. 


Thas Lemma has the following implications. ~ Suppose 
O' sas another query which also Nas y attributes but a lower 
correlation (i.e., s decreases) with the two representatives 
than the original query QO. Assume also that the, difference 
in correlations between the two representatives in relation 
tou 1s the Same as that of O (i e-., d 1S fixed)... Then the 
proportion of the expected number of desired records in Cy 


Wee Glee ge Mons Gy with respect to Q' is lower than that of Q. 


Thus, Q' is expected to have higher performance with regard 
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to Gy as compared with ¢, than is Q. The fact that a. <a. 
a Bh ot 
D212) -S also, ro! lows immediately from lemma 3.5. 


Lemma 3.6: W increases as? increases. 


Proof: By Lemma 3.1, 
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When £ increases by 1 (without increasing s), the new ratio 
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which is then equal to the old ratio Ne The inequality 
holds since by Lemma 3.4, daj/Or, SO (for ry increases from 
OM EOm EAC) a) a. 

This result can be interpreted as follows. Fora 
query Q" which has more attributes than the original query Q 
(ee. , 2 anereases) but “the correlations Gf 0” with the 
representatives are the same as those of Q (i, Gs, 5S. candcd 
are fixed), the proportion of the number of desired records 


in 64 to that of Go with respect to QO"1is: higher than that 
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as compared to C5. 

Finally, we deal with the parameter t. As t 
increases, E(p) and E(q) are expected to increase, which 
tien abrect. the value of (by -(3.11)), and hence We 15 
will be shown that dW/dt> 0. For clusters employing a 
higher threshold t in defining their representatives while 
keeping the other parameters fixed with query Q, the ratio 
W increases. This means roughly that if the process of 
choosing attributes in the representative is more selective, 
W increases. However, there is a limitation: all other 
parameters must remain unchanged. As t increases for a 
given set of clusters, some of the p's may be below the new 
threshold, causing them to become q's. Thus, with respect 
to a given query, the parameters s and d may be altered; s 
is expected to decrease, but the behavior of d is unpredic- 
table. ‘Therefore, 26 is not possible to analyse the bena— 
VWiOr OLrthe ratio W when ( changes in a Given sets oF 


clusters having the same d, s, 2and k but. different t. 
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Since De 1<j<d, it is sufficient to show that da, /dt>d. 
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lenmao. gs Te da (B(q))/dt eatelp) )7dce oe, then da./dt > 0. 


"See Remark 1 right after this lemma for explanation. 
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ELoOOm: 
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dt ec) ia ait d9(E(p)) dt Abeta 3" da 
(3.14) 


It is sufficient to show that each of the two terms 
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since E(p)se(q) and d(B(q))/dt = d(B(p))/de. By Lemma 2.4, 
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Remark 1: If the distributions of the p’s and the q's are 


such that their means occur in the middle of the ranges, then 


E(q)=t/2 and E(p)=(l1+t)/2. Thus d(E(q))/dt-d(E(p))/dt=1/2 


and the hypothesis of the lemma is satisfied. One such dis- 


tribution is the uniform distribution. 
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The actual behavior of W will be explored. in the 
next section by assuming a set of values for the five 


parameters. The choice of these Values is dictated by the 


results obtained in this section. 


3.5 Empirical Results 


In this section, the values of a,7as defined in 
(3.5) and (3.6) of section 3.4 are obtained for various 
VaALwves OL ae, Ky and ss Oy is important because W>(a,)% 
aseshown in ‘section 3.4: As the behavior of the ratioaw 
(and therefore a4) with respect to the different parameters 
BSesuch: 2s “described in section 3.4, Ps sutlicient to 
obtain the values of Oy at certain discrete points of the 
parameters. Its values at other points can be interpolated 
from the given points. E(p) and E(q) are assumed to be 
(l+¢)/2 and t/2 respectively. In Tables 1-4, the values of 
Oy and, the minimum value OL.d such “that the rejection oi 
the cluster Gy us) bikely to De-acceptable toa user, .are 
presented for 0.0<t<0.2, b/a4<5<3k/4 and 422-6. "Le "can be 


reasonably assumed that a user is likely to consider it to 


bermere: advantageous not to retrieve G4 when the concentra- 


tion of desired records in or to that of Go PS eqtial or 
less than W,. Here Wo is arbitrarily assigned a low value 
of 10%. When s is close to k, quite a few records in by 


are likely to meet the user's criteria fer retrieval. | £ts 


rejection will thus lead to unsatisfactory results. Hence, 
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the values of OQ, are not shown for s>3k/4. Piwa ara Ar ae 
is chosen to be greater than 0.2, then attributes which have 


rather high probability of occurrences in the cluster, 


though not as high as t, are removed from the representative, 
and retrieval performance will substantially deteriorate. 
AS a consequence, we examine the case 0.0<t<0.2 only. The 


results are presented as follows. 


(38) PO retrieving very close neighbors, i.e., k=32/4, 

a small difference in correlations between the query and the 
representatives, d>2, is sufficient to bring W below We Lor 
020-0 li sas. tliustrated by Tables 12,3 ANC)... BY sts. olan 
section 3.4, as d increases linearly,W decreases more or 
less exponentially. Thus, for ds2, rejecting cluster Gy 
will result in the loss of very few desired records in com- 
parison to those retrieved in G 5° On the other hand, when 
the user is not very selective (i.e. is not large compared 
with &, such as k=2/4), the representative R5 of Go has to 
be much closer to the query than Ry of Cy (a6 23.2 Cea S" ascent 
in order that the rejection of Gy is justifveds, In some 
cases (e.g., some entries in Tables 1-4a),even when all the 
attributes of the query are included in Ros Wis still well 


above Wy . Under these circumstances, the rejection of Gy 


is surely unacceptable. 
As predicted by the results of last section, the 


values of d to bring W down to Ue when k=2/2 are between 


those obtained for k=%/4 and k=3%/4,as illustrated by the 
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entries of Tables 1-4 (bh) compared to 1-4(a) and (c). 


(a7) When the representative of oe is not too close to 
the query compared with the threshold value k, e.g., s<k/4, 
tCheamajority of the. records in Gy will likely be considered 
as undersirable by the user. In this case, for medium value 
OF Kk, 6.9... K20/2, a4. Small value tor qd, d= 3 2s sutiierent 
to bring W below We for 0.0<t<0.1, as shown in the leftmost 
columns OLeTables.1, 2, 3,,44b,c).. However, aS S goes up to 
3k/4, the situation is similar to that when the threshold 
value k is low,as described in (i). In some cases (e.0 5 
some entries in the rightmost columns of Tables 1-2(a,b)), 


no value of d can make W sufficiently small. 


Casi) If the length of the query increases, it is expected 
that the user's retrieval criterion will be raised in terms 
of the threshold value k and the representative will have 
more attributes in common with the query. Suppose now k=cyk 


S=c5k for some constants Cy and Cy (c),Cyoe{1/4, Laie 3743 
respectively in the tables). As & increases, then most of 
the a's (Os, for i122, are not’ shown im the tables) remain 
unchanged or decrease (compare Tables 1-4 with 5). Asa 


consequence, the d's do not have to increase to maintain 


WsW, - Mhe results of. (i) and (ii) are therefore applicable 
to queries having at least four or more attributes i1.e., 224 


(iv) Let ty 


5S ome constant C 
by ty ct, Oras 


\eahe t,=0.05; and) G=o/2, 2 LOE t,=0.1, 


(in the tables, c=2, 3 and 4 


etc). Them from 
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Tabdies!: 1-5 Oy <Ca, where CLs and a, are the values of 


i 2 i 2 
a, |S corresponding to the thresholds ty and t, respectively. 
In other words, the growth rate of O14 with respect to t is 
Wess cian lincar. Thus, af We is the ratio corresponding 
to the threshold be i=1,2, then W, <coW,. This would allow 
us to estimate the retrieval performance for clusters using 


some threshold t, which is not tabulated. 


Summarizing the results, the following conclusion 


Gan be reached. Clustered search yields excellent retrie— 
val performance for on-line search, when a few records are 
required (i.e., high value for k). As more records are 

desired, retrieval performance will deteriorate. The con- 


centration of desired records in a cluster whose represen- 
tative Has small correlation with the query (s=k/4)) 1s "low 
relative to that of another cluster whose representative has 
auclivgnely higher Gorrelation with the query. = tius etre 
rejection of the former cluster is acceptable to most users. 
An approximate estimate of the ratio W is also given when t 


varies. In view of the results tabulated, t>0.2 is likely 


to be unacceptable. 


Soe Conc MilsLon 


The analytical results presented in this chapter 
depend on only a few essential assumptions. There ere some 
limitations of the results and they should be carefully 


interpreted. Nevertheless the model does represent a 
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general approach to analyse this information retrieval 


process. Further results can be developed by imposing more 


dependence relationships among the five parameters. However, 


the combinatorics involved might prove to be ciftzcult ta 


handle. In this respect, the simulation as described in the 


last section presents some interesting observations. 
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CHAPTER 4 


COMPARISON OF MODELS 


AS Ester LbDuLTon Of Similarities 


Despite of the apparent dissimilarities between the 
models described in the chapters 2 and 3, their basic prin- 
ciples actually do not differ very much. Most of the 
assumptions are made for purpose of estimating the distribu- 
tion of similarities (as defined in chapter 1) between the 
records and the particular query in question. Why is eke 
distribution of similarities important? Given this dis-— 
tribution and the threshold, the percentage of records that 
are "desired" (or as in the feedback process, the relevant 
and irrelevant documents retrieved) can be calculated. 

This quantity provides the basis on which the retrieval 
system is evaluated. If an additional process is imposed 
On the system, this distribution of similarities will be 
altered accordingly, thereby providing a different measure 
of system effectiveness. Relevance feedback (RF) is such 
a process. The distributions of similarities between the 
query and the relevant as well as the irrelevant documents 
are assumed to be normal and characterised by the means 
(u's) and the standard deviations (c's). Mathematically, 


* 


RF is a mapping of these y's and o's to ‘s-pnccoe's. 


The threshold T is not related to the airstri butions of 


Sit rorcteLes Mere. (In chapter 2, T is transformed to Ts 
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just to make sure the same proportion of documents are 
fetraevyed Gach time, For the sake ct comparison only.) In 
the process of retrieval in clustered files (RCF), this 


distribution is not explicitly specified. Instead, it is 


derived from the assumptions at a lower level - the 
ateribute level. (this is why the model adopted here is 
Call “microscopic” as compared to the one for RF). From 


the assumption that the occurrence of one attribute is in- 
dependent Of the occurrence of any Other attribute, tie 
values C(UGy nese rFyrd)y Os122, are derived.) The distri 
bution of similarities for RCF is composed of these values. 
As an analog to the random variables defined for 

the’ RF, we shall demonstrate how the similarity in RCF can 
be defined a random variable. Let us first define an one- 
dimensional binomial distribution Xs for the ith, attribuce 
contained in the query: 


Prob (x,=1) melee 


Prob (x, =0) l-g., 


recalling that J. is the probability wha a record: ageic 
cluster contains the ith attribute. The similarity between 
a record in the data base and the query is then the random 
variable X, where X=LM. The probability, Genercacingd 


at 


minccion tore (in 2) 1s u(1-g.+g;2). Ti HHS Lane eom as 
i 


; i : 
expanded into a polynomial in 2, say va .Z , then it can be 
ny 


easily shown that a=C(Gyr-+-Gyri)s O=1s2 . 
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Unlike its counterparts in RF, X does not conform 
to any well-known distribution. Because of the number of 
independent variables g's in the C-function, the properties 
of this distribution are largely unknown. However, studies 
of the simulation results indicate that it is close to a 
poisson distribution when the sum of g's is small, and 
gradually evolves to a normal distribution as this sum 
increases. It can also be shown that it has either a Single 
maximum or two (equal) maxima in the adjacant positions. 


In this sense, It behaves very much like a binomial distri- 


DUEDOMN, ena indeed, Tt is one, when all the g's are equal. 


Bee eC hOUCe On eMOde | 


The microscopic model is often applied to the pro- 
cesses in which an attribute can be distinguished from the 
Giner attitabutes. Tt as not so°obvicus why the microceopic 
model, rather than the macroscopic one, is adopted for REF: 
There, unlike processes such as indexing the distribution 
of each attribute is not explicitly involved and it seems 
that only clusters need to be differentiated. The reason 
for adapting such a model lies in the very nature of the 
process: the relationship among the query, che representative 
and the cluster. The system always "prefers" searching the 
cluster whose representative is "closer" to therquery 2) ain 
order that the process is worthwhile, the "preferred" 


cluster should yield on the average more desired records 


than the non-preferred one. (otherwise, the clusters are not 
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DEOpDeGLyaoro =nized sta Pathological case which is not con- 
sidered by the analysis.) The outright assumptions on the 
distribution of similarities between the records in a cluster 
and the query (as in RF) are not appropiate here, as the 
representative would then be left out. Transitivity of some 
sort in the form of query - representative - cluster has to 
be established. That is, the correlation of the query and 
the representative, together with the relationship between 
the representative and the records in the cluster should 
provide the distribution of similarities between the query 
and the records inthe cluster. One solution to this problem, 
as 1S presented in chapter 3, is to string these three things 
together by means of the frequencies of occurrences of attri- 


butes in the clusters. Only high frequency attributes can 


be included in the representative. As the representative 
of the perferred cluster, according to the search algorithm, 
has more attributes in common with the query, the preferred 


cluster will contain more attributes which are present in 

the query and are among those attributes occurring most 

frequently, in the cluster... “Therefore, the prererredyclucrer 

should contain higher percentage of desired records on the 

average. Indeed, this argument is true, as shown in section 3.4 
Conceivably, one can analyse RF by means of the 

microscopic model: building up the diSEribations OL asi 

lavdties from the assumptions {i1) and. (i1) 1n chapter 3. 

Here, the complexity of mathematics seems to be the decisive 


factor for choosing the macroscopic model instead, as the 


next section demonstrates. 
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4.3 Mathematical Manipulation 


a aaa 


Comparing the mathematics used in each model and 
the results derived, one must admit the mathematics used 
for the RF is more comprehensible and the results are gene- 
rally more elegant. In fact, at one point in working with 
RCF, the mathematics got almost out of hand! The complexity 
in dealing with RCF are due to the discrete quantities 
involved and the large number of variables. 

TO Simplity the analysis, the records in the Rcr 
are represented by the binary vectors. Were a real number 
allowed for each component of the n-tuple record, more 
variables would be needed to describe the distribution of 
the weight of each attribute within a record. Besides,. some 


more assumptions would have to be added to the model to 


Sescripe the M1iStribution Gf Samilaricies, witen coula ne 
longer be derived by means of the C-function. Unfortunately, 
when only 0-1 values are allowed in the record, the simi- 


larity function becomes discrete valued, so do other 
G@ueantities, such as the proportions of the desired 1 ecords 
within the two clusters, the ratio of these two quantities 
(which is Ww), the threshold k ete. As a result, 20 is 
difficult to observe the behavior of a guantity when another 
integer valued quantity varies. Complicated combinatorics 
is in use rather than the more powerful calculus. Lemma oe 
which states W increases as k increases is a good example 
here. The proof (or disproof) would be easier to come by 


if OW/dk were allowed. On the other hand, the optimal values 
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of a and 6 (in RF) would never have been obtained if some 
of quantities involved had not been continuous. 

We classify the variables used in the models as 
internal variables or external (input) variables. By 
internal variables are meant those which are employed to 
specify the data base with respect to a submitted query, 
ewe esd iStri bution oF ssamilarities.) Ino RF peal) thers 
pairs of U's and O's can be classified as internal variables. 
Their counterparts in PRCF are Fyre++sTo- There are — euch 
g's and £ is dependent on the guery submitted. It is very 
Gittacult, to tell <at this point which. set cr internal 
variables can be more easily manipulated. However, for RF, 
some relationshivs are assumed among the u's, although the 
relative magnitudes of the o's are largely unknown. Ironi- 


cally, each of the g's must not depend on any others. Other- 
wise, the distribution of similarities would be much more 
Complicated than dt 16. now, that 21s to say, tf) 1t could be 
derived at all! 

The external variables are those parameters which 
can somehow be controlled by the system or the user. In RF, 
they are.o and 8. The threshold T is also one of the 
external variables, but it has no impact on the process 
because both Q and 0* are required, for the purpose of 
comparison to retrieve the same amount of documents. How- 
ever, there are five input parameters Li the? RC ee. 


Set wikeand d.- The ratio W depends on all of them, as the 


analysis shows. Moreover, some of these parameters are 
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inter-dependent. As a result, the parameter space is 5- 
dimensional compared to the 2-dimensional plane in RF. One 
simply cannot proceed in the same manner as in RF to locate 
regions that guarantee good results and obtain the optimal 
values for these parameters. It is suspected that the 


analysis conducted in chapter 3 has come close to the mathe- 


matical limitation imposed by the model. 


4.4 Summary 


With only two input parameters and continuous 
distributions of similarities, RF is a simpler process as 
compared to RCF. Coupled with the fact that more stringent 
conditions are imposed on the composition of the data base, 
it perhaps comes as no surprise that the analysis for RF is 
more successful and the results are more impressive. Of 
course, the analysis on RCF is not without merit. It has 
succeeded in building a model, as no one has before, for 
this complicated process from which meaningful results can 
be drawn, based on a minimum number of assumptions. It has 
also devised means by which the user can exercise more 
control on the retrieval process. Eventually, it can lead 
to implementation of a practical search strategy so that the 


user can interact with the system to control the number of 


clusters to be searched. 
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Appendix I 


The Prooks of the technical Lemmas in section 2.4 
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